## [1] "C:/Users/rdugar/Desktop/udacity/EDA"
This dataset comes from Kaggle. [https://www.kaggle.com/murderaccountability/homicide-reports] The Murder Accountability Project is the most complete database of homicides in the United States currently available. This dataset includes murders from the FBI’s Supplementary Homicide Report from 1980 to 2014 and Freedom of Information Act data on more than 22,000 homicides that were not reported to the Justice Department. This dataset includes the age, race, sex, ethnicity of victims and perpetrators, in addition to the relationship between the victim and perpetrator and weapon used.
The following will show the structure of the dataset.
## 'data.frame': 638454 obs. of 24 variables:
## $ Record.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Agency.Code : Factor w/ 12003 levels "AK00101","AK00102",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Agency.Name : Factor w/ 9216 levels "Abbeville","Abbeville County",..: 150 150 150 150 150 150 150 150 150 150 ...
## $ Agency.Type : Factor w/ 7 levels "County Police",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ City : Factor w/ 1782 levels "Abbeville","Acadia",..: 36 36 36 36 36 36 36 36 36 36 ...
## $ State : Factor w/ 51 levels "Alabama","Alaska",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Year : int 1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
## $ Month : Factor w/ 12 levels "April","August",..: 5 8 8 1 1 9 9 7 7 7 ...
## $ Incident : int 1 1 2 1 2 1 2 1 2 3 ...
## $ Crime.Type : Factor w/ 2 levels "Manslaughter by Negligence",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Crime.Solved : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 2 2 1 2 ...
## $ Victim.Sex : Factor w/ 3 levels "Female","Male",..: 2 2 1 2 1 2 1 1 2 2 ...
## $ Victim.Age : int 14 43 30 43 30 30 42 99 32 38 ...
## $ Victim.Race : Factor w/ 5 levels "Asian/Pacific Islander",..: 3 5 3 5 3 5 3 5 5 5 ...
## $ Victim.Ethnicity : Factor w/ 3 levels "Hispanic","Not Hispanic",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Perpetrator.Sex : Factor w/ 3 levels "Female","Male",..: 2 2 3 2 3 2 2 2 3 2 ...
## $ Perpetrator.Age : int 15 42 0 42 0 36 27 35 0 40 ...
## $ Perpetrator.Race : Factor w/ 5 levels "Asian/Pacific Islander",..: 3 5 4 5 4 5 2 5 4 4 ...
## $ Perpetrator.Ethnicity: Factor w/ 3 levels "Hispanic","Not Hispanic",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Relationship : Factor w/ 28 levels "Acquaintance",..: 1 1 27 1 27 1 28 28 27 27 ...
## $ Weapon : Factor w/ 16 levels "Blunt Object",..: 1 14 16 14 16 12 10 10 7 7 ...
## $ Victim.Count : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Perpetrator.Count : int 0 0 0 0 1 0 0 0 0 1 ...
## $ Record.Source : Factor w/ 2 levels "FBI","FOIA": 1 1 1 1 1 1 1 1 1 1 ...
From the above output, we can see that there are 638,454 observations from 24 variables.
Below are the first 10 observations from the dataset.
## Record.ID Agency.Code Agency.Name Agency.Type City State
## 1 1 AK00101 Anchorage Municipal Police Anchorage Alaska
## 2 2 AK00101 Anchorage Municipal Police Anchorage Alaska
## 3 3 AK00101 Anchorage Municipal Police Anchorage Alaska
## 4 4 AK00101 Anchorage Municipal Police Anchorage Alaska
## 5 5 AK00101 Anchorage Municipal Police Anchorage Alaska
## 6 6 AK00101 Anchorage Municipal Police Anchorage Alaska
## 7 7 AK00101 Anchorage Municipal Police Anchorage Alaska
## 8 8 AK00101 Anchorage Municipal Police Anchorage Alaska
## 9 9 AK00101 Anchorage Municipal Police Anchorage Alaska
## 10 10 AK00101 Anchorage Municipal Police Anchorage Alaska
## Year Month Incident Crime.Type Crime.Solved Victim.Sex
## 1 1980 January 1 Murder or Manslaughter Yes Male
## 2 1980 March 1 Murder or Manslaughter Yes Male
## 3 1980 March 2 Murder or Manslaughter No Female
## 4 1980 April 1 Murder or Manslaughter Yes Male
## 5 1980 April 2 Murder or Manslaughter No Female
## 6 1980 May 1 Murder or Manslaughter Yes Male
## 7 1980 May 2 Murder or Manslaughter Yes Female
## 8 1980 June 1 Murder or Manslaughter Yes Female
## 9 1980 June 2 Murder or Manslaughter No Male
## 10 1980 June 3 Murder or Manslaughter Yes Male
## Victim.Age Victim.Race Victim.Ethnicity
## 1 14 Native American/Alaska Native Unknown
## 2 43 White Unknown
## 3 30 Native American/Alaska Native Unknown
## 4 43 White Unknown
## 5 30 Native American/Alaska Native Unknown
## 6 30 White Unknown
## 7 42 Native American/Alaska Native Unknown
## 8 99 White Unknown
## 9 32 White Unknown
## 10 38 White Unknown
## Perpetrator.Sex Perpetrator.Age Perpetrator.Race
## 1 Male 15 Native American/Alaska Native
## 2 Male 42 White
## 3 Unknown 0 Unknown
## 4 Male 42 White
## 5 Unknown 0 Unknown
## 6 Male 36 White
## 7 Male 27 Black
## 8 Male 35 White
## 9 Unknown 0 Unknown
## 10 Male 40 Unknown
## Perpetrator.Ethnicity Relationship Weapon Victim.Count
## 1 Unknown Acquaintance Blunt Object 0
## 2 Unknown Acquaintance Strangulation 0
## 3 Unknown Unknown Unknown 0
## 4 Unknown Acquaintance Strangulation 0
## 5 Unknown Unknown Unknown 0
## 6 Unknown Acquaintance Rifle 0
## 7 Unknown Wife Knife 0
## 8 Unknown Wife Knife 0
## 9 Unknown Unknown Firearm 0
## 10 Unknown Unknown Firearm 0
## Perpetrator.Count Record.Source
## 1 0 FBI
## 2 0 FBI
## 3 0 FBI
## 4 0 FBI
## 5 1 FBI
## 6 0 FBI
## 7 0 FBI
## 8 0 FBI
## 9 0 FBI
## 10 1 FBI
All the variable names are pretty self explanatory.
## Record.ID Agency.Code Agency.Name
## Min. : 1 NY03030: 38416 New York : 38416
## 1st Qu.:159614 CA01942: 23663 Los Angeles : 29007
## Median :319228 ILCPD00: 21331 Chicago : 21331
## Mean :319228 MI82349: 17206 Detroit : 17206
## 3rd Qu.:478841 TXHPD00: 12881 Houston : 13046
## Max. :638454 PAPEP00: 12848 Philadelphia: 12861
## (Other):512109 (Other) :506587
## Agency.Type City State
## County Police : 22693 Los Angeles : 44511 California: 99783
## Municipal Police:493026 New York : 38431 Texas : 62095
## Regional Police : 235 Cook : 22383 New York : 49268
## Sheriff :105322 Wayne : 19904 Florida : 37164
## Special Police : 2889 Harris : 16331 Michigan : 28448
## State Police : 14235 Philadelphia: 12851 Illinois : 25871
## Tribal Police : 54 (Other) :484043 (Other) :335825
## Year Month Incident
## Min. :1980 July : 58696 Min. : 0.00
## 1st Qu.:1987 August : 58072 1st Qu.: 1.00
## Median :1995 December : 55187 Median : 2.00
## Mean :1996 September: 54117 Mean : 22.97
## 3rd Qu.:2004 June : 53662 3rd Qu.: 10.00
## Max. :2014 October : 53650 Max. :999.00
## (Other) :305070
## Crime.Type Crime.Solved Victim.Sex
## Manslaughter by Negligence: 9116 No :190282 Female :143345
## Murder or Manslaughter :629338 Yes:448172 Male :494125
## Unknown: 984
##
##
##
##
## Victim.Age Victim.Race
## Min. : 0.00 Asian/Pacific Islander : 9890
## 1st Qu.: 22.00 Black :299899
## Median : 30.00 Native American/Alaska Native: 4567
## Mean : 35.03 Unknown : 6676
## 3rd Qu.: 42.00 White :317422
## Max. :998.00
##
## Victim.Ethnicity Perpetrator.Sex Perpetrator.Age
## Hispanic : 72652 Female : 48548 Min. : 0.00
## Not Hispanic:197499 Male :399541 1st Qu.: 0.00
## Unknown :368303 Unknown:190365 Median :21.00
## Mean :20.32
## 3rd Qu.:31.00
## Max. :99.00
## NA's :1
## Perpetrator.Race Perpetrator.Ethnicity
## Asian/Pacific Islander : 6046 Hispanic : 46872
## Black :214516 Not Hispanic:145172
## Native American/Alaska Native: 3602 Unknown :446410
## Unknown :196047
## White :218243
##
##
## Relationship Weapon Victim.Count
## Unknown :273013 Handgun :317484 Min. : 0.0000
## Acquaintance:126018 Knife : 94962 1st Qu.: 0.0000
## Stranger : 96593 Blunt Object: 67337 Median : 0.0000
## Wife : 23187 Firearm : 46980 Mean : 0.1233
## Friend : 21945 Unknown : 33192 3rd Qu.: 0.0000
## Girlfriend : 16465 Shotgun : 30722 Max. :10.0000
## (Other) : 81233 (Other) : 47777
## Perpetrator.Count Record.Source
## Min. : 0.0000 FBI :616647
## 1st Qu.: 0.0000 FOIA: 21807
## Median : 0.0000
## Mean : 0.1852
## 3rd Qu.: 0.0000
## Max. :10.0000
##
The above summary shows the counts for each variable and now we are going to visually show some of them on plots.
Before we graph some univariate plots I am going to clean some of the variable entries to make them have more consistency and remove as much inaccurate data as we can.
I changed the state name to have the abbreviated state name code, changed victim ages that were 998 to ‘NA’, changed Perpetrator age of unsolved crimes from 0 to ‘NA’ and finally grouped gun,handgun,firearm,shotgun,rifle as ‘Guns/Firearm’
Below is the histogram showing homicide counts from each state.
We can see that California, Texas and New York have the highest homicide counts. This is no surprise as these are the most populated states in the US.
The histogram below shows the crime case counts handled by each Agency type.
Most homicide cases are under the Municipal Police followed by the Sheriff’s department.
The distribution above is pretty uniform. February has a lower count than the other months and that’s because of the 28 day month.
The plot below shows the frequency polygon of homicide ounts by each year.
There have been frequent up and downs in the homicide count since the 80s but overall, therein is a decreasing trend here. The frequency of homicides has fallen down dramatically since the early 90s.
The plot below shows the count distribution between ‘manslaughter by negligence’ and ‘murder or manslaughter’
Most homicides were of the murder or manslaughter type.
The plot below shows a histogram representing the counts between crimes that’ve been solved vs crimes that are still unsolved.
Looks like about two-thirds of the homicide crimes were solved.
Below is a histogram ditribution of the victims by sex.
Majority of the victims are male, a little less than 500,000 and there are less than 150,000 female victims. It also looks like there have been some cases where the victim sex wasn’t or couldn’t be identified.
The histogram below shows the age distribution of all the victims.
The distribution shows that majority of the victims were between their late teens and their mid 30s. We can also see a high spikes at the age of 99 with about 9000 victims of that age. I suspect this has been inaccurately recorded. There is a similar spike at 0, possibly for the same reason.This plot excludes 974 entries where the age was ‘998’ but was later assigned ‘NA’.
The histogram shows all victims by race.
Like expected, majority of the victims were white or black because they make up a big majority of the American population.
The plot below shows perpetrators grouped by sex via the histogram.
From the histogram we can see that there are a lot more male perpetrators than female. There is a pretty high number of unknown criminals from cases that are still unsolved.
Below is the histogram showing the age distribution of the perpetrators.
Even after cleaning perpetrators’ ages from 0 to ‘NA’ for unsolved crimes, we see that there is an unusually high count of perpetrators with the ages of 0. It doesn’t make sense for about 27000 babies to be murderers so I conclude that some of the entries are unreliable when it comes to age. May be perpetrators for whom the age could never be confirmed were left at 0 and/or someone didn’t care to change that once it was known.
The univariate plot below shows the perpetrators recorded by race.
Black and white people make up most of the perpetrators as they also make up a majority of the population.
The graph below shows the distribution of the different relationships of the victim to the perpetrator.
Of the crimes that were solved, acquaintances, strangers, wives and friends were some of the highest victim relationships. It is important to note here that from looking at the dataset a little more carefully, one can see that there is inconsistency when it comes to the relationship and the sex of the victim or the relationship and the age of the victim. For example, some 0 or 1yr old victims show up as mothers or wives. I believe in some cases, the relationship of the perpetrator to the victim was recorded instead of the other way round.’
There is a graphic representaion below of the weapons that were usually used for the crimes.
‘Gun/Firearm’, ‘Knife’ and ‘Blunt object’ make up the majority of the weapons used amongst all the cases.
There is a histogram below showing the frequency of perpetrators by sex.
Clearly, a majority of the perpetrators are male. About 8 times as many as female criminals. There are about 400,000 male and 50,000 female perpetrators.
The dataset has 638454 observations from 24 variables. The variables are shown below.
Record.ID - A unique row number Agency.Code Agency.Name Agency.Type - (e.g., county police, sheriff) City State - 50 states and DC Year - 1980-2014 Month - January-December Incident - Counting the incidents in an area per month Crime.Type - Murder or Manslaughter or Manslaughter by Negligence Crime.Solved - Yes or No Victim.Sex - Female, Male, or Unknown Victim.Age - 0-99 or 998(Fixed 998 to NA) Victim.Race - eg.(White, Black, Asian/Pacific Islander,etc.) Victim.Ethnicity - Hispanic, Not Hispanic or Unknown Perpetrator.Sex - Female, Male, or Unknown Perpetrator.Age - 0-99(Fixed unsolved cases with Perpetrator age=0 to NA) Perpetrator.Race - (eg.White, Black, Asian/Pacific Islander,etc) Perpetrator.Ethnicity - Hispanic, Not Hispanic or Unknown Relationship - Relationship of the victim to the perpetrator (Inconsistent sometimes) Weapon - Weapon used Victim.Count - Number of victims in addition to the one described Perpetrator.Count - Number of perpetrators in addition to the one described Record.Source - FBI or FOIA (Freedom of Information Act)
The homicide cases are distributed pretty uniformly by month but have decreased in count over time, especialy, dramatically after the early 90s. Majority of the victims and perpetrators are male. The distribution of age among the victims and the perpetrators are very similar with majority of them being between their late teens and early 30s. As you age, the counts go down. The most common weapons were handguns, knives and blunt objects. I would like to note here that firearm,shotgun,gun and rifle are their own catgeories so it would make sense to group them together. Among the races,most victims and perpetrators are black or whiteand both races are pretty equally distributed. This makes sense as well because they make up majority of the population. Most crimes are murder or manslaughter compared to manslaughter by negligence and about two-thirds of the crimes have been solved. Most homicide crimes were committed in California,Texas and New York and this wasn’t surprising as they are the most poulated states.Relationship of the victim to the perpetrator is an interesting variable and we see the victim counts are for Acquaintence, Stranger, Wife and Friend. I would like to note here that sometimes this variable is inconsistent and records the relationship of the perpetrator to the victim instead of the other way round. Most cases were under the Municipal Police or the Sheriff.
The main feature of interest for me is the Solved status of the crime. I would like to see if other features such as weapon, state, race, victim identification affect the solved rate.
I would like to see if Victim Race, Victim Sex, Victim Age, Weapon, State, Agency, Year, Victim Count and Perpetrator count affect the proportion of cases solved.
I created a variable called totalvictims which basically adds 1 to the Victim.Count. I will also be creating a few new datasets below by grouping such that I can have a new variable which records the proportion of cases solved within that group.
We had a few interesting data points, for example, lot of victims recorded as 0, 99 or 998 years of age. I changed the ‘998’ to ‘NA’ but had no way to tell if victims recorded as 99 years old were correct or not. Similarly, we had lot of perpetrators as 0 years of age. In this case, some of the unsolved cases were recorded as 0. I changed it to NA as well. Despite that the counts were pretty high, leading me to suspect that lot of the entries are not accurate when it comes to age. I also grouped gun, firearm, shotgun, rifle and handgun under ‘Gun/Firearm’ such as to have a better understanding of the weapon used.
Below I group the crime cases by the different States and then record how many were solved and how many were not, consequently using those to calculate and record the proportion of the cases solved in each state.
homi_by_state <- ushomi %>%
group_by(State) %>%
summarise(n=n(),solved=length(Record.ID[Crime.Solved=='Yes']),
unsolved=length(Record.ID[Crime.Solved=='No']),
prop_solved=solved/(solved+unsolved))
Below I group the crime cases by the different Years and then record how many were solved and how many were not, consequently using those to calculate and record the proportion of the cases solved in each Year.
homi_by_Year <- ushomi %>%
group_by(Year) %>%
summarise(n=n(),solved=length(Record.ID[Crime.Solved=='Yes']),
unsolved=length(Record.ID[Crime.Solved=='No']),
prop_solved=solved/(solved+unsolved))
Below I group the crime cases by Victim sex and then record how many were solved and how many were not, consequently using those to calculate and record the proportion of the cases solved by each Sex.
homi_by_Victimsex <- ushomi %>%
group_by(Victim.Sex) %>%
summarise(n=n(),solved=length(Record.ID[Crime.Solved=='Yes']),
unsolved=length(Record.ID[Crime.Solved=='No']),
prop_solved=solved/(solved+unsolved))
Below I group the crime cases by the Agency working on it and then record how many were solved and how many were not, consequently using those to calculate and record the proportion of the cases solved by each Agency.
homi_by_Agency <- ushomi %>%
group_by(Agency.Type) %>%
summarise(n=n(),solved=length(Record.ID[Crime.Solved=='Yes']),
unsolved=length(Record.ID[Crime.Solved=='No']),
prop_solved=solved/(solved+unsolved))
Below I group the crime cases by Victim ages and then record how many were solved and how many were not, consequently using those to calculate and record the proportion of the cases solved for each age.
homi_by_Victimage <- ushomi %>%
group_by(Victim.Age) %>%
summarise(n=n(),solved=length(Record.ID[Crime.Solved=='Yes']),
unsolved=length(Record.ID[Crime.Solved=='No']),
prop_solved=solved/(solved+unsolved))
Below I group the crime cases by Victim race and then record how many were solved and how many were not, consequently using those to calculate and record the proportion of the cases solved for each race.
homi_by_Victimrace <- ushomi %>%
group_by(Victim.Race) %>%
summarise(n=n(),solved=length(Record.ID[Crime.Solved=='Yes']),
unsolved=length(Record.ID[Crime.Solved=='No']),
prop_solved=solved/(solved+unsolved))
Below I group the crime cases by the Weapons used and then record how many were solved and how many were not, how many victims were affected by that weapon and consequently using those to calculate and record the proportion of the cases solved for each weapon.
homi_by_weapon <- ushomi %>%
group_by(Weapon) %>%
summarise(n=n(),solved=length(Record.ID[Crime.Solved=='Yes']),
unsolved=length(Record.ID[Crime.Solved=='No']),
prop_solved=solved/(solved+unsolved),
tot_victims=sum(totalvictims))
Below I group the crime cases by the victim counts and then record how many were solved and how many were not, consequently using those to calculate and record the proportion of the cases solved for each group.
homi_by_Victimcount <- ushomi %>%
group_by(Victim.Count) %>%
summarise(n=n(),solved=length(Record.ID[Crime.Solved=='Yes']),
unsolved=length(Record.ID[Crime.Solved=='No']),
prop_solved=solved/(solved+unsolved))
Below I group the crime cases by Perpetrator count and then record how many were solved and how many were not, consequently using those to calculate and record the proportion of the cases solved for each group.
homi_by_Perpcount <- ushomi %>%
group_by(Perpetrator.Count) %>%
summarise(n=n(),solved=length(Record.ID[Crime.Solved=='Yes']),
unsolved=length(Record.ID[Crime.Solved=='No']),
prop_solved=solved/(solved+unsolved))
Below I start making some bivariate plots using some of the newly created grouped sets.
Below is a plot showing the proportion of cases solved by each year and the number of cases from each year.
##
## Pearson's product-moment correlation
##
## data: prop_solved and n
## t = 0.19296, df = 33, p-value = 0.8482
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3030652 0.3627600
## sample estimates:
## cor
## 0.03357196
The plot seems to show that there is an inverse relationship between the solve rate and the number of cases that happen in the year. This makes sense as police resources gradually get more restrained if the number of homicides go up and there is more on the plate. The solve rate from the early-mid 80s are still the highest and there seems to be a declining solve rate since 2010 despite of the falling homicide numbers. In order to investigate that further, I run a correlation test between the homicide counts and proportion of homicides solved for each year. This turns out to be about 0.034 which basically shows that there is almost no correlation between the 2 variables when grouped by age.
Below is a histogram showing the proportion of cases solved by each State.
##
## Pearson's product-moment correlation
##
## data: prop_solved and n
## t = -3.2065, df = 49, p-value = 0.002367
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6207952 -0.1591408
## sample estimates:
## cor
## -0.41646
From the plot above we can say that some states do better than other while some areas like DC have a very low rate of solving cases. Some of the states that’ve very high closure rates are North Dakota, Montana, South Dakota, South Carolina, Idaho and Wyoming. All of them have more than 90% closure rates. I belive solve rates are inversely related to the population and population density of the area because the places that are on the lower spectrum here are some of the most densely populated while the ones on the higher spectrum have lower population densities. The correlation test between the number of cases a state has and the state’s closure rate shows that there is a negative correlation of -0.417 between them although it is not very strong.
The plot below shows solve rates by each Agency and gives us a quick overview of the most and least efficient agencies.
##
## Pearson's product-moment correlation
##
## data: prop_solved and n
## t = -1.098, df = 5, p-value = 0.3222
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8963179 0.4674435
## sample estimates:
## cor
## -0.4407716
The plot and correlation test above shows that there is some negative correlation between the number of cases under an Agency and the solve rate for that agency. This correlation is not very strong though. Tribal and State police have the highest closure rates while the county and municipal police have the lowest.
The plot below shows solve rates by Victim Sex.
The barplot above shows that the solve rate is slightly higher for cases where the victim is female(about 77%) versus when the victim is male(about 68%) The solve rate is a lot lower(about 34%) when the victim sex hasn’t been identified and that is not surprising. It obviously makes it a lot harder for the police to understand where to look for the perpetrator, motive, people of interest,etc.
The plot tells us if the race of the victim has any effect on the solve rate of the case.
From the plot above, it doesn’t look like that race of the victim has a big influence on the outcome of the case. The highest solve rate is for Native American/Alaskan Native and this is probably because a bunch of these cases might be under Tribal Police jurisdiction areas which have low population densities and the highest Solve rate for any agency. Amongst the other races, black victim cases have a solve rate of 66% while Asian and White victim cases have a solve rate of 70% and 74% respectively.
The plot below shows solve rates by Victim age.
##
## Pearson's product-moment correlation
##
## data: prop_solved and n
## t = -3.1011, df = 99, p-value = 0.002512
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4658968 -0.1084164
## sample estimates:
## cor
## -0.2975509
Newborns and young kids have a high solve rate(85-90%) but then the proportion of solved cases gradually dips as the kids get older reaching lowest levels for victims between their late teens and early 30s. After that, it rises again until about the early 40s and stays pretty constant at around (70-75%) solve rate for victims as old as mid 90s. For victims in their late 90s, the solve rate dips suddenly getting to the lowest rate at age 99. The pattern of the solve rate looks negatively correlated with how many cases there were for people around that age. For example, earlier we saw that people between their late teens and mid 30s made up majority of the victims and for this group the solve rate was comparatively lower and similarly very low for the surprisngly high number of cases for people aged 99. I ran a correlation test to check this assumption and there indeed was some negative correlation but it wasn’t strong(-0.3) probably because of the pretty constant solve rate for victims between their 40s and mid 90s.
The plot tells us if the choice of weapon has any effect on the solve rate of the case.
The weapon based solve rate shows us that if the victim died because of drugs, falling, drowning or poison there were higher chances of solving the case. If the crime was committed through an unknown weapon, strangulation, fire or gun/firearm the solve rate was comparatively lower. In case of strangulation and an unidentified weapon the closure rate was considerably lower.
The plot tells us if Victim count influenced solve rates in any way.
##
## Pearson's product-moment correlation
##
## data: Victim.Count and prop_solved
## t = 4.0333, df = 9, p-value = 0.002958
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3904242 0.9466197
## sample estimates:
## cor
## 0.8023777
There does seem to be an increasing trend between the solve rate and the number of victims in the case. I ran a correlation test between the variables and it turned out to be 0.80 which means there is indeed a fairly strong correlation between the number of victims and the solve rate of the case. This could be also due to the fact that mass shooting or bombing tragedies have more resources allotted and prioritized.
The graph below is going to be between proportion of cases solved versus the number of perpetrators involved.
##
## Pearson's product-moment correlation
##
## data: Perpetrator.Count and prop_solved
## t = 2.1902, df = 9, p-value = 0.05623
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.01583524 0.87870752
## sample estimates:
## cor
## 0.5896409
The bar plot and the correlation test above certainly show that if the number of perpetrators in addition to the main criminal is greater than 0, then the solve rate is much higher and almost 100% for bigger groups. But this relation could be the aftermath of the outcome of the case. For example, the police would most likely know about the involvement of multiple people only if the case has already been cracked and solved so there does seem to be an apparent effect on solve rate. Or it could be the fact that when multiple people are involved in a crime, there are more loose ends and the police can use this knowledge to have 1 person rat out the rest,etc.
Apart from the bivariate plots above relating the main feature variable with other variables I will also plot some plots to relate 2 non-feature variables. Below is the scatterplot between the victim and perpetrator age. I remove perpetrator ages of 0 and victim ages of 99 because both of these are unrealistically high and most likely inaccurately reported.
##
## Pearson's product-moment correlation
##
## data: as.numeric(Perpetrator.Age) and as.numeric(Victim.Age)
## t = 267.14, df = 418140, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3792286 0.3844069
## sample estimates:
## cor
## 0.3818207
There seems to be slight positive linear relationship between the 2 variables. The correlation test I ran yields a correlation coeeficient of 0.38 which means that there is a some positive correlation there.
Below is the heatmap between showing the total victim count for each weapon.
## ushomi$Weapon: Blunt Object
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.044 1.000 8.000
## --------------------------------------------------------
## ushomi$Weapon: Drowning
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.169 1.000 5.000
## --------------------------------------------------------
## ushomi$Weapon: Drugs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.101 1.000 7.000
## --------------------------------------------------------
## ushomi$Weapon: Explosives
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 4.326 11.000 11.000
## --------------------------------------------------------
## ushomi$Weapon: Fall
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.042 1.000 2.000
## --------------------------------------------------------
## ushomi$Weapon: Fire
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 2.295 3.000 11.000
## --------------------------------------------------------
## ushomi$Weapon: Gun/Firearm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.128 1.000 10.000
## --------------------------------------------------------
## ushomi$Weapon: Knife
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 1.00 1.07 1.00 6.00
## --------------------------------------------------------
## ushomi$Weapon: Poison
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.581 1.000 10.000
## --------------------------------------------------------
## ushomi$Weapon: Strangulation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.084 1.000 7.000
## --------------------------------------------------------
## ushomi$Weapon: Suffocation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.208 1.000 8.000
## --------------------------------------------------------
## ushomi$Weapon: Unknown
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.102 1.000 9.000
The plot above shows us that the highest victim counts are in cases involving explosives, as one would intuitively expect. Gun/Firearm has most cases of a mid-sized casualty with total victim counts of 3-10. Most of these weapon classes have been used to kill 1-3 people on multiple cases but fire and firearms are the most devastating and prolific weapons aside from explosions when it comes to mass casualty.
The proportion of homicides that were solved definitely varied by Year,State, Agency, Victim age, Victim Sex, marginally by Victim Race, Weapon and quite a lot by Victim count and Perpetrator count. For example, if the Victim sex couldn’t be identified the solve rate was very low, about 50% for victims with unknown race. If strangulation was the weapon or if the weapon was unknown, the solve rate was very low as well. States with higher population density had lower solve rates compared to states with lower population and less dense dispursion. On the other hand, cases handled by Tribal or State police, cases with higher victim or perpetrator counts, cases involving drug, fall, drowning or poisoning had higher solve rates.
Yes, there was a slight linear relationship between the victim and perpetrator ages with correlation around 0.38. I also found out that fire, gun/firearm and explosions are the weapons most commonly used in mass casualty cases where total victim counts are greater than 3.
The strongest relationship I found was between Victim.Count(victims apart from the main victim) and prop_solved(Solve/Closure rate). The correlation between the 2 variables was pretty strong at 0.80 .It would make sense that cases with higher victim counts are prioritized and allotted more resources.
I first group the homicide data by both State and Year and record the closure rate for every state by each year.
Below I make a scatter and line plot of the solve rate by each year for every state by faceting.
From the graph above, I can see that it is not always the states which are less dense do better and states that are more dense have a low solve rate. A lot of states like Vermont, South Dakota, Alaska and Alabama have had their fair share of ups and downs when it comes to the solve rate whereas states like Texas and Oklahoma have been pretty consistent over the years. We can also see densely populated areas like DC has had very low lows and very high highs over the years. States like California and Illinois have had a decline in solve rate over the years with Illinois especially going down pretty steeply whereas states like Washington and New Mexico have had an upward trend in solve rates over the years.
Below, I make a scatter plot between the victim and pereptrator ages with some of the most frequent victim relationships categorizing them.
When the victim is a kid under 12 or 13, they are usually sons or daughters which means the perpetrator is normally one of the parents. When the victim is either a stranger or an acquaintance between their teens and mid 70s, their perpetrator is normally under his or her late 30s. Friends, girlfriends and wives are victims almost along a linear pattern which means it happens across all ages and their perpetrators are usually about the same age. We also see that majority of the wives are killed by their husbands between the the ages of early 60s and mid 80s. I can also make out that older victims who are usually family are killed by members who are between their teens and mid 30s.
I first group the homicide data by both Sex and Year and then record the closure rate for the grouped victims over the years.
Below I make a scatter and line plot of the solve rate for each year segregated by male or female victims.
From the graph above we can see that although cases with both male or female victims started at around 73% solve rate in 1980, over time the solve rate for cases with male victims has gradually gone down to about 65% in 2014 while the solve rate for cases with female victims has gone up gradually over the years to about 85%. So Victim sex does seem to be a factor in solving a case and that has been more pronounced over the years.
I first group the homicide data by both Race and Year and then record the closure rate for all the races over the years.
Below I make a scatter and line plot of the solve rate for each year grouped by victim race.
At closer look, when race is divided by years it definietly looks like that the victim race is also a factor influencing the closure of a case. While all the races started round the same solve rate of 70-75%, over the years the solve rate for black victim cases has steadily gone down from 75% in 1980 to about 60% in 2014 while the solve rate for white victim cases has had an upward trend to about 80% in 2014. Native American/Alaskan native victims usually have a higher solve rate than the other races like we found out earlier and this again could be due to less densely populated areas, involvement of Tribal police, etc but they have had their share of volatility from year to year as well reaching as low as 70% and as high as 90%. Asian/Pacific Islander victims have also had very varying solve rates over the years, going as low as 63% and as high as 80%.
I first group the homicide data by both Race and Year, them merge it with the another group created just for weapons. I do this so that I am able to calculate the proportion of victims affected by criminals from every race based on their weapon choice.
Below is a heat map adding more detail to our weapon vs victim count plot from earlier. In the following plot, I use the homicide grouped by perpetrator race, weapon used and then fill them with varying colors distributed on the basis of the proportion of all victims killed by that weapon.
From the above plot, we find the proportion of victims affected by criminals of all race based on the choice of weapon. We earlier found that Gun/Firearms, Knives and Blunt objects are weapons with most victim counts so looking at it more closely, it appears that about 34% of all victims from gun/firearm usage are at the hands of black criminals while 33% of them are killed by white perpetrators, 39% and 37% of all knife based murders are done by white and black criminals respectively. In case of blunt weapons, 43% of all murders are carried out by criminals who are white. So, it looks like white criminals kill their victims with blunt objects a lot more than perpetrators from other races. It is also interesting to see that when the victim dies of falling 37% of the victim counts have been accumulated by black perpetrators. About 85% of all victims who died because of an explosive were at the hands of a white criminal. We also see that 45% of all victims where strangulation or an unknown weapon was used have an uknown perpetrator race meaning that they were most likely unsolved. It is pleasantly surprising to see that no Native American/Alaskan Native has ever used explosives in a homicide.
I create a dataframe grouping the homicide cases by Agency Type and Year and then record the solve rate for each Agency over the years.
In the following plot, I find the solve rate of the agencies over the years
The above plot shows us the solve rates over the years for all the agencies. Tribal police and regional police have had more varied results over the years and that is probably because they are the ones that get the least amount of cases so proportion for a year could be skewed when working with only a few cases. The closure rate for cases under the State Police seems to have improved steadily over the years while special and municipal police have experienced a decreasing trend in the proportion of cases that are solved. The result efficiency for the sheriff’s department and county police has stayed about the same over the years.
From the above multivariate plots, I found out that the solve rate for cases with female victims has gone up while those with male victims has gone down over time. It also appears that cases with black victims have had declining rates of closure since 1980 so these are certainly 2 factors affecting the chances of a case being solved. Most states have had pretty variable solve rates over time, sometimes, going up and down by big margins in consecutive years. There are some states like Illinois and California that have declined in closure efficiency over the years while others like Texas and Washington have stayed constant and improved respectively. Solve rates of state police have improved over the years while that of the special and municipal police have deteriorated.
The victim and perpetrator age scatterplot enhanced by relationships showed some interesting interactions. For example, When the victim is a child under 12 or 13, they are usually sons or daughters which means the perpetrator is normally one of the parents. When the victim is either a stranger or an acquaintance between their teens and mid 70s, their perpetrator is normally under his or her late 30s. Friends, girlfriends and wives are victims almost along a linear pattern which means it happens across all ages and their perpetrators are usually about the same age. We also see that majority of the wives are killed by their husbands between the the ages of early 60s and mid 80s. I can also make out that older victims who are usually family are killed by members who are between their teens and mid 30s. Another interesting interaction was from the heatmap showing that no native american/alaskan native has ever used explosives in order to kill people and it was surprising to see that most explosive based homicides were committed by white criminals.
I included this plot because it gave me insight into interactions between the victim age, perpetrator age and relationhip variables. When the victim is a child under 12 or 13, they are usually sons or daughters which means the perpetrator is normally one of the parents. When the victim is either a stranger or an acquaintance between their teens and mid 70s, their perpetrator is normally under his or her late 30s. Friends, girlfriends and wives are victims almost along a linear pattern which means it happens across all ages and their perpetrators are usually about the same age. We also see that majority of the wives are killed by their husbands between the the ages of early 60s and mid 80s. I can also make out that older victims who are usually family are killed by members who are between their teens and mid 30s. It also appears that some older husbands(60-80s) are killed by comparatively younger wives (50-60s).
I chose this plot because it shows the growing difference in solve rate between cases with male and female victims. Both genders started at about the same closure rate but over time the difference has amounted to nearly 20% in the proportion of cases being solved. Male victims had a solve rate of about 65% in 2014 while female victims now have solve rates of around 85%. That trend and those numbers certainly show that victim sex has influence on the outcome of homicide case.
I chose this plot because we find the proportion of victims affected by criminals of all races based on the choice of weapon. It appears that about 34% of all victims from gun/firearm usage are at the hands of black criminals while 33% of them are killed by white perpetrators, 39% and 37% of all knife based murders are done by white and black criminals respectively. In case of blunt weapons, 43% of all murders are carried out by criminals who are white. So, it looks like white criminals kill their victims with blunt objects a lot more than perpetrators from other races. It is also interesting to see that when the victim dies of falling, 37% of the victim counts have been accumulated by black perpetrators. About 85% of all victims who died because of an explosive were at the hands of a white criminal. We also see that 45% of all victims where strangulation or an unknown weapon was used have an uknown perpetrator race meaning that they were most likely unsolved. It is pleasantly surprising to see that no Native American/Alaskan Native has ever used explosives in a homicide. For Asian/Pacific Islanders, the highest proportion towards victim count across different weapons was surprisingly for cases with drowning. This interaction between weapon use and perpetrator race is useful because it allows allows one to build part of a criminal profile based on the weapon used, victim race, victim age, demographic information of the area for homicide cases where the perpetrator is unknown.
While doing exploratory data analysis on this dataset, the few challenges that I faced were the existence of so many variables, and some inaccurate or wrongly reported age or relationship information. I cleaned all the entries I could but for some, I had nothing to go off on. I didn’t need to change the table formatting from wide to long or vice-versa for any of my explorations so that is something that went smoothly. The insight I got through my visualizations gives relationship , victim and perpetrator age based interactions; Victim sex, Race and State based differences in the outcome; and fatalities, perpetrator race and weapon choice based interactions. For example, we now know that 45-50% of all murders with unknown weapons and strangulations have been unsolved or that white criminals use blunt objects, fire,poison, drugs and explosives to kill their victims a lot more than perpetrators from other races. Using historical weapon use and proportional victim count findings,looking at the age, sex and race of the victim, we can possibly narrow down and make educated estimates while building a perpetrator profile. Finding other useful interactions such as victim and perpetrator race, adding geographical and demographical information about the crime location, I believe one could build reliable criminal profiles for homicide cases where the perpetrator is unknown and/or build a model predicting closure rate of a case based on the available information. These would also throw light on new insightsabout closure rates in densely populated cities versus less sensely populated rural/suburban areas.